Scaling up top-K cosine similarity search
نویسندگان
چکیده
Article history: Received 21 September 2009 Received in revised form 23 August 2010 Accepted 23 August 2010 Available online 8 September 2010 Recent years have witnessed an increased interest in computing cosine similarity in many application domains. Most previous studies require the specification of a minimum similarity threshold to perform the cosine similarity computation. However, it is usually difficult for users to provide an appropriate threshold in practice. Instead, in this paper, we propose to search top-K strongly correlated pairs of objects as measured by the cosine similarity. Specifically, we first identify the monotone property of an upper bound of the cosine measure and exploit a diagonal traversal strategy for developing a TOP-DATA algorithm. In addition, we observe that a diagonal traversal strategy usually leads to more I/O costs. Therefore, we develop a max-first traversal strategy and propose a TOP-MATA algorithm. A theoretical analysis shows that TOPMATA has the advantages of saving the computations for false-positive item pairs and can significantly reduce I/O costs. Finally, experimental results demonstrate the computational efficiencies of both TOP-DATA and TOP-MATA algorithms. Also, we show that TOP-MATA is particularly scalable for large-scale data sets with a large number of items. © 2010 Elsevier B.V. All rights reserved.
منابع مشابه
Scaling up all pairs similarity search pdf
Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity.ABSTRACT. Given a large collection of sparse vector data in a high dimensional space, we investigate the problem of finding all pairs of vectors whose similarity. Scaling up all pairs similarity search, Published by ACM. The problem of finding a...
متن کاملCosine Similarity Search with Multi Index Hashing
Due to rapid development of the Internet, recent years have witnessed an explosion in the rate of data generation. Dealing with data at current scales brings up unprecedented challenges. From the algorithmic view point, executing existing linear algorithms in information retrieval and machine learning on such tremendous amounts of data incur intolerable computational and storage costs. To addre...
متن کاملDynamic Multi-keyword Top-k Ranked Search over Encrypted Cloud Data
Nowadays, more and more people are motivated to outsource their local data to public cloud servers for great convenience and reduced costs in data management. But in consideration of privacy issues, sensitive data should be encrypted before outsourcing, which obsoletes traditional data utilization like keyword-based document retrieval. In this paper, we present a secure and efficient multi-keyw...
متن کاملScaling up cosine interesting pattern discovery: A depth-first method
This paper presents an efficient algorithm called CosMinert for interesting pattern discovery. The widely used cosine similarity, found to possess the null-invariance property and the anti-cross-support-pattern property, is adopted as the interestingness measure in CosMinert . CosMinert is generally an FP-growth-like depth-first traversal algorithm that rests on an important property of the cos...
متن کاملSimDex: Exploiting Model Similarity in Exact Matrix Factorization Recommendations
We present SIMDEX, a new technique for serving exact top-K recommendations on matrix factorization models that measures and optimizes for the similarity between users in the model. Previous serving techniques presume a high degree of similarity (e.g., L2 or cosine distance) among users and/or items in MF models; however, as we demonstrate, the most accurate models are not guaranteed to exhibit ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Data Knowl. Eng.
دوره 70 شماره
صفحات -
تاریخ انتشار 2011